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1.  INTRODUCTION 


In  this  report  we  present  a  summary  of  our  successful  efforts  to  optimize/parallelize  the  steady-state 
particle  tracing  (streamline)  algorithm  in  EnSight.  This  project  was  funded  by  the  U.S.  Army  Research 
Laboratory  (ARL),  located  at  Aberdeen  Proving  Ground,  MD,  and  is  considered  completed  as  of  the  date 
of  this  report.  All  modifications  as  a  result  of  this  project  have  been  included  in  the  newest  commercial 
release  of  EnSight,  version  5.5.1. 

2.  BACKGROUND 

A  streamline  visualizes  a  steady-state  vector  field  by  displaying  the  path  that  a  massless  particle  would 
follow  if  placed  in  that  field.  At  each  point  on  the  particle  trace,  the  direction  of  the  trace  is  parallel  to 
the  vector  field  at  that  point  and  time. 

To  create  the  streamline,  the  user  specifies  a  starting  location,  known  as  an  emitter  point  Given  this 
emitter  point  EnSigJit  performs  an  integration  through  the  vector  field  using  a  fourth-order  Runge-Kutta 
algorithm  with  varying  time  stqt.  After  initialization  (where  EnSight  creates  a  table  describing  which 
computational  elements  are  neighbors),  the  streamline  generation  is,  in  theory,  a  parallel  task  as  each 
particle  path  is  independent  of  any  other  path. 

If  the  streamline  generation  can  be  performed  quickly,  the  user  has  the  ability  to  look  at  the  particle 
paths  in  "real  time,"  meaning  he  or  she  can  move  the  emitter  location  via  the  mouse  and  get  an 
"iiTunediate"  update  of  the  particle  path  based  on  the  new  emitter  location.  Users  benefit  greatly  from  this 
real-time  ability  since  they  can  "interact"  with  the  data  set  Since  the  5.5  release,  EnSight  has  the  ability 
to  create  and  di^lay  streamlines  in  real  time  as  long  as  the  data  sets  are  relatively  small. 

A  significant  number  of  data  sets  at  ARL  fall  into  the  "large"  (and  getting  larger)  realm.  In  fact,  just 
to  calculate  a  small  number  of  streamlines  in  these  data  sets  once  can  require  many  seconds,  effectively 
eliminating  the  ability  to  produce  real-time  streamlines. 

In  the  spring  of  1995,  ARL  installed  a  Silicon  Graphics  Power  Challenge  Array  containing  8  boxes 
of  12  R80(X)  CPU’s  and  2  GB  of  memory  each  (for  a  total  of  96  CPU’s  and  16  GB  of  memory),  providing 
a  significant  opportunity  to  develop  and  test  parallel  applications.  With  funding  from  ARL,  Computational 
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Engineering  International,  Inc.,  has  spent  the  past  few  months  optimizing  and  parallelizing  the  EnSight 
streamline  generator  wifli  the  hope  that  the  performance  gains  would  be  sufficient  to  allow  good 
interactivity  with  "large"  data  sets — a  benefit  not  only  to  ARL,  but  to  all  EnSight  users. 


3.  PERFORMANCE  ANALYSIS 


In  order  to  maximize  the  performance  of  any  code,  developers  should,  in  the  following  order,  closely 
examine: 


(1)  Algorithm  Algorithm  modifications  have  the  largest  potential  for  sigruficant  performance 

improvements.  If  the  logic  is  not  sound,  performance  increases  are  difficult  to 
gain  through  other  means. 

(2)  Tuning  Significant  performance  improvements  are  possible  by  paying  close  attention  to 

the  hardware  architecture  in  order  to  take  full  advantage  of  memory  and  CPU 
design.  Unfortunately,  this  tuning  process  usually  requires  one  to  have  a 
significant  knowledge  of  the  hardware  and  is  thus  usually  beyond  the  scope  of 
most  iqrplication  developers. 

(3)  Parallelization  Significant  performance  improvements  are  possible  if  the  tq)plication  code  contains 

areas  that  can  be  performed  in  parallel.  A  code  can  usually  be  executed  in  parallel 
if  each  processor  can  woik  completely  independent  of  any  other  processor  and 
care  is  taken  to  resolve  any  input/output  (VO)  performed  in  the  parallel  region. 


In  our  work  with  EnSight,  we  have  closely  examined  and  implemented  modifications  to  enhance 
perfoimance  through  algorithm  changes  and  implementation  of  parallel  regions.  Each  of  these  tasks  is 
examined  in  more  detail  in  the  following  sections. 


4.  CHANGES  TO  THE  ALGORITHM 

In  order  to  jM'oduce  the  streamlines,  the  first  task  is  to  find  an  element  that  contains  the  user-defined 
emitter  location.  We  call  this  task  a  "global  search."  Through  our  own  knowledge,  and  confirmed  by 
using  performance  monitoring,  this  global  search  is  the  most  time-consuming  task  of  the  streamline 
generation  and  is  the  bottleneck  in  gaining  performance  that  allows  good  interactivity.  The  algorithm  in 
use  had  been  designed  for  small  memory  vector  architectures  (i.e.,  Cray),  where  memory  use  was 
ultimately  more  important  than  performance  (though  the  performance  was  not  bad  on  this  architecture 
since  the  algorithm  vectorized).  This  algorithm  did  not  run  well  on  RISC  systems,  especially  if 
performance  was  the  goal. 
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The  global  search  algorithm  was  completely  rewritten  with  the  goal  of  maximizing  performance  while 
keeping  the  memoiy  usage  as  ccmservative  as  possible.  The  new  algorithm  calculates  the  two  bounding 
largest  coordinates  of  each  computational  cell,  packs  the  minimum/maximum  values  into  one  word  of 
memory,  and  sorts  the  lists  by  the  minimum  values.  A  global  search  using  this  information  can  be 
performed  by  simply  checking  for  bounding  values  in  each  look-up  array.  Techniques  are  also 
implemented  to  quickly  search  the  arrays  and  take  advantage  of  the  previous  search  by  examining 
bandwidth  parameters. 

5.  PARALLELIZATION 

Restructuring  of  the  streamline  algorithm  was  necessary  in  order  to  eliminate  memoiy  allocation, 
memoiy  conflicts,  and  I/O  operations — ^aU  of  which  can  cause  significant  problems  in  creating  parallel 
streamlines.  After  this  task,  we  hand-coded  the  directives  necessary  to  force  the  parallel  region  to  execute 
in  parallel  (the  SGI  parallel  C  compiler  would  not  make  our  do-loop  run  in  parallel  because  of  perceived 
dependencies  due  to  fimction  calls).  As  each  streamline  is  calculated,  it  is  communicated  to  the  EnSight 
client  for  display.  This  communication  is  performed  in  a  "critical"  area  that  allows  only  one  processor 
at  a  time  to  execute.  At  most,  12  processors  are  allowed  to  run  the  parallel  region. 

6.  RESULTS 

In  order  to  test  our  enhancemraits,  ARL  allowed  us  dedicated  use  of  one  of  the  Power  Challenge  Array 
boxes  to  measure  dedicated  wall-clock  time  during  testing.  In  order  to  eliminate  any  network  load 
fluctuations  that  could  invalidate  our  testing,  we  eliminated  aU  output  and  display  of  the  streamlines  (i.e., 
we  measured  only  the  wall-clock  time  to  compute  the  traces).  All  testing  was  performed  with  the  same 
data  set,  using  a  different  number  of  emitter  points  and  a  different  number  of  processors.  For  comparison, 
we  also  modified  EnSight  5.5  to  eliminate  its  output  and  display  and  tested  it  with  the  same  data  set  and 
emitter  points. 

The  data  set  tested  is  shown  in  Figure  1  and  was  generated  at  ARL  by  Mr.  Bernard  J.  Guidos  of  the 
Weapons  Technology  Directorate,  Propulsion  and  Flight  Division,  Aerodynamics  Branch.  The 
computational  grid  provided  contains  663,000  grid  points.  The  area  of  interest  is  in  the  wake  area  of  the 
projectile  (shown  in  Figure  2).  This  wake  area  has  been  modeled  using  two  separate  data  blocks  where 
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the  mesh  density  is  different  between  the  two  blocks.  Streamlines  are  started  in  the  left  data  blocks  as 
shown  and  must  cross  the  discontinuity  between  the  two  blocks.  When  the  data  are  converted  to  the 
internal  EnSight  data  format,  the  problem  contains  628,170  hexahedral  elements. 

Table  1  shows  the  results  of  varying  the  munber  of  emission  points  and  number  of  processors,  as  well 
as  the  timing  values  for  EnSight  5.5.  All  timings  are  in  wall-clock  seconds. 


Table  1.  Results  of  Benchmark  Testing 


No.  of  Bnission 
Points 

Version 

5.5 

(s) 

Version  5.5.1 

1  CPU’s 
(s) 

2  CPU’s 
(s) 

4  CPU’s 
(s) 

8  CPU’s 
(s) 

12  CPU’s 
(s) 

3 

29.51 

0.07 

0.05 

0.03 

0.04 

0.04 

10 

82.68 

0.29 

0.15 

0.15 

0.12 

0.14 

29 

205.97 

0.96 

0.64 

0.43 

0.34 

0.34 

75 

NA 

2.68 

1.80 

1.28 

0.75 

0.75 

118 

915.18 

4.97 

3.70 

2.10 

1.73 

1.57 

173 

NA 

7.38 

5.64 

3.39 

2.48 

2.20 

214 

1,646.75 

8.35 

6.23 

3.53 

2.67 

1.88 

342 

NA 

21.47 

18.19 

13.18 

4.90 

4.20 

1,395 

NA 

54.72 

41.68 

26.08 

15.35 

10.87 

3,165 

NA 

123.50 

94.79 

58.64 

29.94 

24.57 

5,651 

NA 

220.60 

173.29 

104.67 

55.02 

39.20 

NA  =  Not  Attempted 


The  appendix  shows  the  results  in  Table  1  in  graphical  form. 

7.  ANALYSIS 


Each  test  modifies  the  number  of  emission  points.  As  the  emission  points  are  not  in  the  same  location 
for  each  test,  care  must  be  taken  in  drawing  direct  correlation  between  tests  (i.e.,  if  you  double  the  number 
of  emission  points  [using  one  processor],  you  do  not  necessarily  double  the  waU-clock  time  since  the 
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streamlines  are  emitting  from  different  locations  in  each  test).  If  the  mesh  and  velocity  field  were 
completely  unifonn,  these  comparisons  could  be  made. 

The  modified  global  search  algorithm  in  EnSight  5.5.1  has  increased  performance  by  approximately 
200  times  over  the  old  algorithm  as  implemented  in  version  5.5.  It  must  be  remembered  that  this 
performance  increase  comes  at  the  expense  of  ^proximately  7.5  MB  of  memory  (for  a  data  set  of  628k 
elements).  In  our  opinion,  this  added  memory  use  is  an  excellent  tradeoff  given  this  substantial 
performance  increase. 

Some  of  the  performance  gains  between  EnSight  5.5  and  5.5.1  can  probably  be  attributed  to  compiler 
differences.  While  both  versions  have  been  compiled  with  optimization,  version  5.5  was  compiled  to  take 
advantage  of  the  mips2  instruction  set,  while  the  5.5.1  version  tested  used  mips4. 

Adding  processors  does  not  always  result  in  a  good  payoff  in  terms  of  performance.  The  performance 
gains  depend  to  a  large  degree  on  how  busy  die  processors  can  be  kept  and  over  how  much  time.  The 
best  parallel  speedup  is  achieved  where  there  are  longer  run  times  overcoming  the  operating  system  setup 
of  starting  the  parallel  region.  Eight  processors  give  us  a  speedup  of  3-4  times  when  compared  to  one 
processor.  To  put  this  in  perspective,  eight  processors  give  us  an  approximate  performance  increase  of 
6(X)  times  when  combined  widi  the  modified  search  global  algorithm  and  when  compared  to  the  5.5 
release  for  the  test  containing  214  emission  points.  Having  more  processors  than  particle  traces  does  not 
help  (see  the  results  for  three  particle  traces). 

There  appears  to  be  significant  overhead  of  starting  a  parallel  region.  This  overhead  appears  to  be 
the  same  regardless  of  the  number  of  processors  used.  This  is  evident  by  examining  the  relatively  modest 
performance  increase  in  going  fiom  1  to  2  processors  compared  to  the  performance  between  2  and  4,  8, 
or  12  processors. 

The  performance  increase  when  adding  additional  processors  is  nonlinear.  This  is  probably  due  to  one 
or  more  of  the  following  factors  (and,  possibly,  other  factors): 
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(1)  Load  Balance  In  order  to  get  a  perfect  speedup  in  a  parallel  region,  each  processor  must  do  the 

exact  same  amount  of  woric  (i.e.,  no  processor  must  be  idle).  For  most 
applications,  this  is  almost  impossible  to  do,  resulting  in  a  load  imbalance. 

EnSight  uses  an  SGI-spedfic,  load-balancing  mechanism  called  gss  (guided  self¬ 
scheduling),  which  behaves  best  when  each  iteration  of  the  parallel  loop  varies 
*  only  slightly  from  any  other  iteration.  For  this  particular  data  set,  fliis  is  not 

always  true.  The  streamlines  immediately  behind  the  projectile  take  longer  to 
compute  than  die  rest  of  the  traces — ^probably  significantly  longer.  This  may 
create  a  condition  in  which  some  of  the  processors  are  idle,  waiting  for  others  to 
finish. 

(2)  Local  Searches  If  a  global  search  has  already  been  performed,  EnSight  can  use  die  result  (the 

element  number  containing  the  emitter)  in  starting  a  new  streamline  by 
performing  a  local  search.  A  local  search  starts  at  the  last  known  location  and 
uses  the  neighboring  element  table  and  the  element  shape  function  to  find  the 
new  element  containing  the  emitter.  This  algorithm  works  great  as  long  as  the 
last  location  is  relatively  close  to  the  new  location — a  condition  which  is  ^ical 
if  you  run  on  one  processor  and  you  are  creating  several  streamlines.  However, 
this  condition  is  probably  violated  when  multiple  processors  are  engaged  due  to 
the  way  the  workload  is  split  among  the  processors. 

(3)  Memory  In  a  globally  shared  memory  system  (such  as  the  SGI),  memory  contention  is  a 

real  possibility  when  utilizing  multiple  processors.  This  will  cause  some  of  the 
processors  to  wait  for  memory  longer  than  they  would  if  only  one  processor  was 
being  utilized. 


One  must  keep  in  mind  when  viewing  the  results  of  the  performance  test  that  the  timings  only  indicate 
the  time  required  to  graerate  the  streamlines  once.  When  performing  streamline  gp.nftratinn  in  a  "real- 
time"  environment,  the  particles  are  continuously  updated  as  the  user  moves  the  emitter  location.  For 
example,  the  test  of  29  particle  traces  would  be  capable  of  generating  3  frames/second  when  using 
8  processors,  but  would  only  generate  1  firame/second  when  using  1  processor.  In  terms  of  user 
interactivity,  this  is  a  significant  difference. 


One  might  be  tempted  to  say  that  the  larger  tests,  with  hundreds  or  thousands  of  emitters,  are  purely 
academic  since  no  one  would  want  to  view  the  complicated  picture  which  results.  However,  the  curved 
vector  arrow  function  in  EnSight  also  uses  this  same  algorithm — ^it  is  routinely  used  to  calmlate  curved 
arrows  at  the  number  of  emission  points  used  in  these  tests. 

Finally,  the  results  of  this  test  show  only  a  trend,  as  all  of  the  factors  described  previously  will  cause 
all  data  sets  to  behave  somewhat  differently. 
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8.  CONCLUSIONS 


By  taking  advantage  of  new  algorithms  and  parallelization,  we  have  shown  that  real-time  particle 
tracing  is  possible  in  data  sets  typical  of  those  being  computed  at  ARL.  The  performance  increase  has 
been  so  dramatic  that  it  is  not  only  a  difference  in  performance,  but  a  difference  in  ability  (i.e.,  we  can 
now  process  tasks  that  were  impossible  in  the  previous  release  of  EnSigJit  [see  preceding  tests  comparing 
wall-clock  time  for  214  emitters]). 

Our  experience  witii  this  project  has  shown  that  there  are  significant  opportunities  for  performance 
improvements  in  EnSight.  In  the  future  we  hope  to  expand  our  woik  into  other  EnSight  fimctionalities 
and  possibly  investigate  performance  tuning  and  code  in-lining. 
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8725  JOHN  J  KINGMAN  RD 
STE0944 

FT  BELVOIR  VA  22060-6218 

1  DIRECTOR 

US  ARMY  RESEARCH  LAB 
ATTN  AMSRL  OP  SD  TA 
2800  POWDER  MILL  RD 
ADELPm  MD  20783-1145 

3  DIRECTOR 

US  ARMY  RESEARCH  LAB 
ATTN  AMSRL  OP  SD  TL 
2800  POWDER  MILL  RD 
ADELPm  MD  20783-1145 

1  DIRECTOR 

US  ARMY  RESEARCH  LAB 
ATTN  AMSRL  OP  SD  TP 
2800  POWDER  MILL  RD 
ADELPm  MD  20783-1145 


ABERDEEN  PROVING  GROUND 

2  DIR  USARL 

ATTN  AMSRL  OP  AP  L  (305) 


15 


NO.  OF 

COPIES  ORGANIZATION 


2  COMPUTATIONAL  ENGRNG INTL 
ATTN  DR  ANDERS  GRIMSRUD 
PO  BOX  14306 

RESEARCH  TRIANGLE  PARK  NC 
27709 


ABERDEEN  PROVING  GROUND 

12  DIR,  USARL 

ATTN:  AMSRL-SC,  W.  MERMAGEN 

AMSRL-SC-C, 

C.  NIETUBICZ 
M.  WILLIAMS 

AMSRL-SC-CC, 

D.  THOMPSON 
R- ANGEUNI 
J.HARE 

J.  GROSH 
AMSRL-WT-PB, 

B.  GUIDOS 
H.EDGE 

E.  FERRY 

K.  HEAVEY 

AMSRL-WT-PD,  D.  HOPKINS 


USER  EVALUATION  SHEET/CHANGE  OF  ADDRESS 


This  Laboratory  imdertakes  a  continuing  effort  to  improve  the  quality  of  the  reports  it  publishes.  Your  comments/answers 
to  the  items/questions  below  will  aid  us  in  our  efforts. 

1.  ARL  Report  Nmnbei/Author  ARL-CR-298  (Grimsrud’) _ Date  of  Report  August  1996 

2.  Date  Report  Received _ 

3.  Does  this  report  satisfy  a  need?  (Comment  on  purpose,  related  project,  or  other  area  of  interest  for  which  the  report 

will  be  used.) _ 


4.  Specifically,  how  is  the  report  being  used?  (Information  source,  design  data,  procedure,  source  of  ideas,  etc.) 


S.  Has  the  information  in  this  report  led  to  any  quantitative  savings  as  far  as  man-hours  or  dollars  saved,  operating  costs 
avoided,  or  efSciencies  achieved,  etc?  If  so,  please  elaborate. _ 


6.  General  Comments.  What  do  you  think  should  be  changed  to  improve  future  reports?  (Indicate  changes  to 
organization,  technical  content,  format,  etc.) _ 


Organization 

CURRENT  Name 

ADDRESS  _ 

Street  or  P.O.  Box  No. 

City,  State,  Zip  Code 

7.  If  indicating  a  Change  of  Address  or  Address  Correction,  please  provide  the  Current  or  Correct  address  above  and  the 
Old  or  Incorrect  address  below. 


Organization 


OLD  Name 

ADDRESS  _ 

Street  or  P.O.  Box  No. 


City,  State,  Zip  Code 


(Remove  this  sheet,  fold  as  indicated,  tape  closed,  and  mail.) 
(DO  NOT  STAPLE) 


DEPARTMENT  OF  THE  ARMY 


OFFICIAL  BUSINESS 


BUSINESS  REPLY  MAIL 

FIRST  CLASS  PERMIT  NO  0001,APG,MD 


POSTAGE  WILL  BE  PAID  BY  ADDRESSEE 


DIRECTOR 

U.S.  ARMY  RESEARCH  LABORATORY 
ATTN:  AMSRL-SC-CC 

ABERDEEN  PROVING  GROUND,  MD  21005-5067 


